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Abstract —Head-related impulse responses (HRIRs) are 
subject-dependent and direction-dependent filters used in spatial 
audio synthesis. They describe the scattering response of the 
head, torso, and pinnae of the subject. We propose a structural 
factorization of the HRIRs into a product of non-negative and 
Toeplitz matrices; the factorization is based on a novel extension 
of a non-negative matrix factorization algorithm. As a result, the 
HRIR becomes expressible as a convolution between a direction- 
independent resonance filter and a direction-dependent reflection 
filter. Further, the reflection filter can be made sparse with 
minimal HRIR distortion. The described factorization is shown 
to be applicable to the arbitrary source signal case and allows 
one to employ time-domain convolution at a computational cost 
lower than using convolution in the frequency domain. 

Index Terms —Head-related impulse response, non-negative 
matrix factorization, Toeplitz, convolution, sparsity 

I. Introduction 

The human sound localization ability is rooted in subcon¬ 
scious processing of spectral acoustic cues that arise due 
to sound scattering off the listener’s own anatomy. Such 
scattering is quantified by a linear, time-invariant, direction- 
dependent filter known as the Head-Related Transfer Function 
(HRTF) m. HRTF knowledge allows presentation of realistic 
virtual audio sources in a Virtual Auditory Display (VAD) 
system so that the listener perceives the sound source as ex¬ 
ternal to him/her and positioned at a specific location in space, 
even though the sound is actually delivered via headphones. A 
number of additional effects such as environmental modeling 
and motion tracking are commonly incorporated in VAD for 
realistic experience tZl. l3l. 

The HRTF is typically measured by a placing a small 
microphone in an individual’s ear canal and making a record¬ 
ing of a broadband test signaf] emitted from a loudspeaker 
positioned sequentially at a number of points in space. The 
HRTF is the ratio of the spectra of microphone recording at 
the eardrum and at the head’s center position in the absence 
of the individual. Thus, the HRTF is independent of the 
test signal and the recording environment and describes the 
acoustic characteristics of the subject’s anthropometry (head, 
torso, outer ears, and ear canal). The inverse Fourier transform 
of HRTF is the (time domain) filter’s impulse response, called 
the Head-Related Impulse Response (HRIR). 

Yuancheng Luo, Dmitry N. Zotkin, and Ramani Duraiswami are with 
the Perceptual Interfaces and Reality Lab at the University of Maryland 
Institute for Advanced Computer Studies in College Park, 20742 USA, e- 
mail: yluol@umd.edu, dz@umiacs.umd.edu, ramani@umiacs.umd.edu. 

1 Various test signals, such as impulse, white noise, ML sequence, Golay 
code, frequency sweep, or any broadband signal with sufficient energy in the 
frequencies of interest can be used for the measurements. 


The primary goal of the current work is to find a short and 
sparse HRIR representation so as to allow for computation¬ 
ally efficient, low latency time-domain convolution between 
arbitrary (long) source signal y and short HRIR x El, 0. it 
is expected that direct convolutioi£] with short and sparse x 
would be more efficient w.r.t. latency and cost than frequency- 
domain convolution using the fast Fourier transform (FFT^] 
0 , 0 . 

Somewhat similar approaches has been explored in the 
literature previously. In the frequency domain, the HRTF has 
been decomposed into a product of a common transfer function 
(CTF) and a directional transfer function (DTF) 0 , 0 , (9), 
where the CTF is the minimum-phase filter with magnitude 
equal to average HRTF magnitude and the DTF is a residual. A 
more recent work on Pinna-Related Transfer Function (PRTF) 
OH, iHTl . fl2l , fl3l provided successful PRTF synthesis 
model based on deconvolution of the overall response into 
ear-resonance (derived from the spectral envelope) and ear- 
reflection (derived from estimated spectral notches) parts. The 
novelty of the current work is that the time-domain modeling 
is considered and constraints are placed on ’’residual impulse 
response” (the time-domain analog of the DTF) to allow for 
fast and efficient real-time signal processing in time domain. 
Further, the tools to achieve this decomposition (semi-non- 
negative matrix factorization with Toeplitz constraints) are 
novel as well. 

II. Problem Formulation 

We propose the following time-domain representation of an 
HRIR x G M m given by 

X & f * g, g > 0, (1) 

where * is the linear convolution operation, / € jjM-if+i 
is a “common impulse response” derived from the subject’s 
HRIR set, and g G R K is a sparse non-negative “residual”; 
the length of g is K. In analogy with terms commonly 
used in PRTF research, hereafter / is called the “resonance 
filter” and g the “reflection filter”. The resonance filter is 
postulated to be independent of measurement direction (but of 
course is different for different subjects), and the directional 
variability is represented in g , which is proposed to represent 
instantaneous reflections of the source acoustic wave off 
the listener’s anatomy; hence, g is non-negative and sparse. 
The computational advantage of such a representation is the 

2 (x *y)i = yv Xjyi-j -|_i for x and y zero-padded as appropriate 
3 Fourier Transform convolution x*y = {T {.x} o T {y}} for Fourier 

transform operator T {} and element-wise product o. 
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ability to perform efficient convolution with an arbitrary source 
signal y via the associative and commutative properties of the 
convolution operation given by 

y * X = (y * f) * g = {y * g) * f. (2) 

If y is known in advance, the convolution with / is direction- 
independent and can be precomputed in advance. Thereafter, 
direct time-domain convolution with a short and sparse g is 
fast and can be performed in real time. Moreover, even in the 
case of streaming y, computational savings are possible if the 
output signal has to be computed for more than one direction 
(as it is normally the case in VAD for trajectory interpolation). 

To learn the filters / and g , we propose a novel extension 
of the semi-non-negative matrix factorization (semi-NMF) 
method G3I. Semi-NMF factorizes a mixed-signed matrix 
X « FG t E R MxN into a product of a mixed-signed matrix 
F and a non-negative matrix G minimizing the approximation 
error in the least-squares sense. We modify the algorithm so 
that the matrix F has Toeplitz structure ; then, FG T is nothing 
but a convolution operation with multiple, time-shifted copies 
of / placed in columns of F (see Fig. [If. Thus, the overall 
approach for computing / and g is as follows: a) form matrix 
X from individual HRIRs, placing them as columns; b) run 
Toeplitz-constrained semi-NMF on X ; c) take the first column 
and row of F as /; and d) for each direction, obtain non¬ 
negative g by taking a corresponding row of G. 



III. Semi-non-negative Toeplitz Matrix 
Factorization 


A. Background 

The original non-negative matrix factorization (NMF) ca 
was introduced in the statistics and machine learning literature 
as a way to analyze a collection of non-negative inputs X in 
terms of non-negative matrices F and G where X « FG T . 
The non-negativity constraints have been used to apply the 
factorization to derive novel algorithms for spectral clustering 
of multimedia data ri7) . Semi-NMF 03 is a relaxation of the 
original NMF where the input matrix X and filter matrix F 
have mixed sign whereas the elements of G are constrained 
to be non-negative. Formally, the input matrix X E R MxN is 
factorized into matrix F G R MxK and matrix G G R NxK by 
minimizing the residual Frobenius norm cost function 

min \\X - FG t \\ 2 f = tr ((. X - FG T ) T {X - FG T )), ( 3 ) 


where tr() is the trace operator. For N samples in the data 
matrix X , the i th sample is given by the M-dimensional row 
vector Xi = X.^ and is expressed as the matrix-vector product 
of F and the K -dimensional row vector Gi = G^ : . The 
number of components K is selected beforehand or found 
via data exploration and is typically much smaller than the 
input dimension M. The matrices F and G are jointly trained 
using an iterative updating algorithm m that initializes a 
randomized G and performs an iterative loop computing 


FXG(G T G)~ 1 , 


Gij <— Gij 


\ 


(■ X T F)± + [G(F T F)~]ij 
{XTF)- j + [G{F^F )+] ij 1 


+ I Qij (Q)~ — Qjd. 

' i j 2 l o 2 


(4) 


The positive definite matrix G T G G R KxK in Eq. [ 4 ] is small 
(fast to compute) and the entry-wise multiplicative updates for 
G ensure that it stays non-negative. The method converges 
to the optimal solution that satisfies Karush-Kuhn-Tucker 
conditions fl4l as the update to G monotonically decrease 
the residual in the cost function in Eq. [3] for a fixed F, and 
the update to F gives the optimal solution for the same cost 
function for a fixed G. 


Fig. 1. Modified semi-non-negative matrix factorization generalizes time- 
domain convolution for a collection of HRTFs X, resonance filter /, and 
non-negative reflection filters in G. 


The paper is organized as follows. In section [IITJ the modi¬ 
fied semi-NMF algorithm is derived, with further extension 
to enforce a sparseness constraint on G by formulating it 
as a regularized L\ norm non-negative least squares problem 
(Li-NNLS) HB. As the cost of time-domain convolution is 
proportional to the number of non-zero (NZ) elements in g, 
decreasing K (i.e., increasing sparsity) reduces computational 
load at the cost of increased approximation error. Experimental 
results are presented in section [IV] along with the discussion. 
Finally, section |Vl| concludes the paper. 


B. Notational Conventions 


To modify semi-NMF for learning the direction-independent 
/ and a set of direction-dependent g , we introduce the follow¬ 
ing notation. Assume that F is a Toeplitz-structured matrix 
and Fij = &i-j for parameters 0 = [@i_m? • • • ? ©k-i] t ; 
thus, all entries along diagonals and sub-diagonals of F are 
constant. Hence, the Toeplitz structure is given by 


Top (0) 


©o ©1 

0-i ©o ©i 

©2 — M • • • ©-1 

©1 — M ©2 — M 


©if-2 ©if-1 
©if-2 


©0 ©1 

0-i ©o 

( 5 ) 
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and is fully specified by parameters {©o,..., &k-i} and 
{@o,..., ©i— m} along the first row and column. The Toeplitz 
matrix can also be represented indirectly as a linear combina¬ 
tion of the parameters weighted by shift matrices S k G M m x k 
as 

K -1 

F= Y. SkQk ’ S ij=kj-k- ( 6 ) 

k=l—M 


An arbitrary matrix F can be approximated by its nearest 
Toeplitz matrix F, which is defined as the minimizer of the 
residual Frobenius norm cost function given by 

2 


J = 


F-F 




F t F - 2 F t F ■ 


F t F 


dJ 

dQk 


= 2tr (F - F) 1 


dF 

dQ k 


OF 

dQk 


= S k 


(7) 


where the partial derivatives of J w.r.t. 0 k are linearly 
independent due to the trace term. By equating the derivatives 
to zero, the solution 0 is given by 


tr (F T S k ) 

min (A; + M,K -k,K,M)' 


( 8 ) 


Hence, a Toeplitz approximation F to an arbitrary matrix F 
is obtained simply by taking the means of the subdiagonals of 
F. 


C. Toeplitz-Constrained Semi-NMF 

Assuming that a solution of the factorization problem F 
has in fact Toeplitz structure as per Eq. [6} the cost function 
in Eq. [3] is quadratic (convex) w.r.t. each ©& and the set of 
parameters © has a unique minimizer. The partial derivatives 
of the cost functiorj^] are given by 
2 


d 


x-fg t 


dtr 


((X - FG t ) t {X - FG T )^j 


dQ, 


= 2tr 


dQk 


M—l 



(9) 


S 1 ©, I - S k XG 


where the product of shift matrices S kT S l can be expressed as 
the square shift matrix S' 1-1 *. To solve for the set of parameters 
©, one needs to set the partial derivatives to zero, which yields 
a linear equation AQ = b where A G Rl 0 l x l 0 l, |©| = M + 
K — 1 is a Toeplitz square matrix, and b G M m x 1 is a vector 
specified as 

A M +KM+i = ^ ( G T GS i ~ k ) , b M +k = tr (> T XC) . 

( 10 ) 


For positive-definite A, the matrix F is given by the linear 
equation solution: 

F = Top (©), <d = A~ 1 b, (11) 

which is the unique minimizer of Eq. [3] Thus, to en- 
force Toeplitz structure on F, the iterative update F G- 


4 Unlike the case considered in section 
[9] are linearly dependent. 


III-B 


the partial derivatives in Eq. 


XG{G T G)~ 1 in the semi-NMF algorithm (Eq. [4| is replaced 
by computing F as prescribed by Eq. [lO] and Eq. ITT" 


Note that to perform a convolution between / and g (i.e., 
to reconstruct the HRIR) one needs to further constrain the 


in order to fulfill the 


Toeplitz matrix F given in Eq. [ 5 ] 
filter length requirements. Such convolution is equal to the 
constrained Toeplitz matrix-vector product 


©o 0 ... 0 

© —! ©0 0 


Xi = 


: ... 0 
©L-M • • • ©-1 ©0 

0 ©X-M • • • 0-1 


Gn 

GiK 


0 ... 0 Qk-m 

( 12 ) 

where the parameters {@k-m-i, • • • ? ©i-m, © 1 , • • •, ©k} 
are set to zero. Only the NZ parameters {© 0 ,...©x-m} are 
solved for in a smaller (M — K + 1) x (M — K + 1) sized 
linear system as per Eq. [lO] and Eq. |TT] These NZ parameters 
form the resonance filter /: 

f = {Q 0 ,...Qk-m} gR m ~ k+1 . (13) 


D. Minimizing the Number of Reflections 

To introduce sparsity, we restrict the number of NZ entries 
(NNZE) in G. In order to do that, we fix the trained resonance 
filter F and solve for each reflection filter g = Gi separately 
in a penalized Li-NNLS problem formulation fl8l given by 

mm\\V(FGf+ s.t > 0, (14) 

Gi 

where D G 1Z M * xM is some transformation of the residua^] 
Three transformations are considered. 

1. The identity transform Vi = I G R MxM , which 
directly minimizes the residual norm while penalizing large 
magnitudes in the reflection filter Gi. 

2. The convolution transform 

T>c = Top (0 C ) Gl MxM 

Q?:M -1 = W :M- 1), eg!_ M = K(0 : 1 - M), 

(15) 


which is characterized by the Gaussian filter Af a (x) = 

i — X<1 

2ct2 • This transform effectively low-passes the recon¬ 
structed HRIR. It is equivalent to windowing the frequency- 
domain residuals with a Gaussian filter of inverse bandwidth; 
hence, the low-frequency bins are weighted heavier in the 
reconstruction error. 

3. The window transform 


XV = diag(^(0:M —1)) eK MxM , (16) 


5 A free Matlab solver for L i-NNLS is available online at http://www. 
stanford.edu/~boyd/papers/ll_ls.html 

0 Convolution in time domain is equivalent to windowing in frequency 
domain, and vice versa. 
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Training Error SD 





Fig. 2. RMSE / SD error progress over 25 algorithm iterations. 


where v a (x) = e~^ is a Gaussian-like filter. The window 
transform has the effect of convolving the signal spectrum with 
a filter v a (x) as if both were time series, which is equivalent 
to windowing HRIR in time domain by the Gaussian filter 
of inverse bandwidth. In this way, the earlier parts of the 
reconstructed HRIR contribute to the reconstruction error to 
the larger extent. 

The additional regularization term A in Eq. [14] affects the 
sparsity of g as increasing A decreases the NNZE. In our 
practical implementation, we also discard elements that are 
technically non-zero but have small (< 10 -4 magnitude) as 
they contribute little to the reconstruction. The final algorithm 
for learning the resonance and reflection filters with the 
sparsity constraint on the latter is summarized in Algorithm [T] 

Algorithm 1 Modified Semi-NMF for Toeplitz Constraints 
Require: Filter length K , transformation matrix D E 
HRIR matrix X E R MxN , max-iterations T 


Ha, ED, El, El. We pre-process the data as follows: a) 
convert HRIR to min-phase; b) remove the initial time delay 
so that the onset is at time zero; and c) normalize each HRIR 
so that the absolute sum over all samples is equal to unity. 

As mentioned previously, our processing intends to separate 
the arbitrary impulse response collection of into “resonance” 
(direction-independent) and “reflective” (direction-dependent) 
parts. For the HRIR, we believe that these may correspond 
to pinna/head resonances and instantaneous reflections off the 
listener’s anthropometry, respectively. Such an approach may 
also be applicable to other IR collections; for example, room 
impulse responses 1241 may be modeled as a convolution 
between a shared “resonance” filter (i.e. long reverberation 
tail) and the “reflective” filter (early sound reflections off 
the walls). In order to obtain a unique decomposition using 
Algorithm [T] one would need to have the number of directional 
IR measurements larger than the IR filter length, which may 
be impractical. This topic is a subject of future research. 


1 

2 

3 

4 

5 

6 

7 

8 


G 4— rand (TV, K ) \\ Random initialization 

for t = 1 to T do 


0 4— A~ x b \\ 
F 4— Top (0) 
Update G. \\ 

end for 

Fine-tune G. \\ 

return F , G 


Solve for resonance via EqfL 
\\ Toeplitz matrix via Eqs. 


10 


12 


Multiplicative update via Eq. |4| 


11 


13 


Vary A, a in Eqs. 14 16 15 


IV. Results 

A. HRIR/HRTF Data Information 

We have performed an extensive series of experiments 
on the data from the the well-known CIPIC database DU; 
however, the approach can be used with arbitrary HRTF data 


B. Error Metric 

For evaluation, we consider two error metrics - the root- 
mean square error (RMSE) and the spectral distortion (SD), 
representing time-domain and frequency-domain distortions 
respectively: 



(x - fg t \ 

2 

RMSE = \ 

\ j 

F 

MN 

sd (ffk'i,irfr'>) =, 

i M ( 

i= 1 \ 

10 m) 


where Xj is the reference HRIR, FGj is the reconstruction of 
it, = T{Xj} is the reference HRTF, Xj is the reference 
HRIR, and = T jFGj j is the HRTF reconstruction. 
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Min-phase HRIRs 



-2 0 2 
Azimuth 

Sparse Min-phase HRIRs 



-2 0 2 
Azimuth 



Fig. 3. Top row: Slices of reflection filter matrix G trained without sparsity constraint; also, original HRIR after min-phase processing, time delay removing, 
and normalization. Bottow row: Slices of reflection filter matrix G trained with sparsity constraint applied (A = 10 -3 ); also, HRTR reconstructed from it. 


Another feasible comparison is validation of the recon¬ 
struction derived from sparse representation (Eq. |T4]) against 
the naive regularized least squares (Li-LS) approximation of 
HRIR Xi given by 

vam.\\V {x - Xi)\\l + \\x\ x , (18) 

X 

where x G M Mx1 (i.e. magnitude-constrained approximation 
without non-negativity constraint). The difference between SD 
error of Li-NNLS approximation and of Li-LS approxima¬ 
tion is a metric of advantage provided by our algorithm in 
comparison with LS HRIR representation, which retains large- 
magnitude HRIR components irrespective of their sign. 

C. Resonance and Reflection Filter Training 

The resonance and reflection filters / and G are jointly 
trained via Algorithm [T] for 50 iterations for N = 1250 number 
of samples, M = 200 time-bins, and K = 25 filter length 
using left-ear data of CIPIC database subject 003. N and M 
here are fixed (they are simply the parameters of the input 
dataset). The choice of K is somewhat arbitrary and should 
be determined experimentally to obtain the best compromise 
between computational load and reconstruction quality. Here 
we set it to the average human head diameter (« 19.2 cm) 
at the HRIR sampling frequency (44100 Hz). Visual HRIR 
examination reveals that most of the signal energy is indeed 
concentrated in the first 25 signal taps. 

Fig. [2] shows RMSE and SD error over 50 iterations of 
Algorithm [T] with no sparsity constraint on G (i.e. A = 0.0). 


The final filter / is a periodic, decaying functions resembling 
a typical HRIR plot. The final matrix G is shown in the top 
row of Fig. [3] The mean NNZE for G is 22.74 (it is less than 
K due to removal of all elements with magnitude less than 
10 -4 ). As it can be seen, the SD error achieved is 3.0 dB over 
the whole set of directions. 

In order to obtain the sparse HRIR representation, we re-ran 
the algorithm using identity transformation in Li-NNLS con¬ 
straint and a fixed A = 10 -3 (this parameter was determined 
empirically to cut the NNZE approximately in half). The final 
matrix G obtained in this case is shown in the bottom row of 
Fig. [3] It is sparse as expected and has a number of non-zero 
bands spanning the time-direction domain; thus, only the most 
salient components of G are retained. In this case, the mean 
NNZE is 11.48 and the SD error is 5.3 dB over the whole 
set of directions. In the following section, the guidelines for 
setting A are considered. 

D. Regularization Term Influence 

We investigate the effects of varying the A term in Eq. [14] 
under the identity transform T>i on the NNZE in G and on 
the RMSE / SD error. A sample HRIR is chosen randomly 
from the data set. Fig. [4] shows the effect of changing A on 
NNZE, RMSE, SD error, and reconstructed HRIR/HRTF per 
se. The trends that one can see in the figure are consistent 
with expectation; it is interesting to note that as A increases, 
low-magnitude elements in G are discarded whereas both the 
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Fig. 4. Influence of the L\ regularization term A in Eq[l4]on NNZE and on the reconstruction error for sample HRIR. 


dominant time-domain excitations and the shape of the spectral 
envelope in the reconstructed HRIR are preserved. 

Further analysis of the NNZE and of the SD error over the 
full set of HRIR measurement directions is shown in Fig. [5] 
Note that ipsilateral reflection filters have lower NNZE[J and 
achieve lower SD error. This is understandable, as they do fit 
better into a “resonance-plus-reflections” model implied in this 
work. On the other hand, contralateral HRIR reconstruction 
requires larger NNZE and results in more distortion, presum¬ 
ably due to significant reflections occuring later than K = 25 
time samples; note that while some effects of head shadowing 
(attenuation / time delay) are removed in the preprocessing 
step, others may not be modeled accurately; on the other 
hand, accurate HRIR reproduction on contralateral side is 

7 The variability exhibited can not be due simply to total HRIR energy 
differences as they were all normalized during pre-processing. 


not believed to be perceptually important l25l . Improvement 
in quality of contralateral HRIR reconstruction is a subject 
of future research. One approach is to learn separate HRIR 
decomposition, possibly with different length of fig filters, 
for different sub-regions of space. 

Finally, in Fig. [6] we compare the Li-NNLS reconstruction 
against the naive Li-LS reconstruction in terms of the convo¬ 
lution filter NNZE and SD error for varying A and a number 
of directions selected on horizontal and on medial planes. For 
all of these, the difference between solutions is less than 2.0 
dB SD; further, for 13 (out of 16) cases the Li-NNLS solution 
has the same or better reconstruction error than naive Li-LS 
solution in highly-sparse (NNZE < K/2) case. This implies 
that our decomposition is able to find a resonance filter and a 
sparse set of early reflections that represent the HRTF better 
than the dominant magnitude components of the original HRIR 
per se. 
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Azimuth 



Azimuth 


Fig. 5. A map of NNZE and SD error over the full spherical coordinate range 
for left-ear HRIR data. Note smaller NNZE / SD values on ipsilateral side. 


E. Transformation Bandwidth Optimization 

Further reduction of the SD error is possible via use of 
transform functions defined in section |III-D| Application of 
these functions would result in different weights placed on 
different aspects of reconstructed HRIR. Hence, we investigate 
the selection of bandwidth term a in Eq. JT6| with no L\ penalty 
term (A = 0) for the window transfornjj 

As mentioned before, application of the window transform 
T>w causes smoothing in the frequency domain; the amount of 
smoothing depends on the bandwidth term a. Fig. [7] shows the 
SD error dependence on a for one sample HRIR. Obviously 
as bandwidth a -A oo, the window transform becomes the 
identity transform; indeed, SD error stays constant for cr > 70. 
It can be seen though that the minimum SD error occurs at a 
finite a = 30 (for this particular HRIR). The parameter a can 
be efficiently fine-tuned (via fast search methods) separately 
for each HRIR in the subject’s HRTF set. Table [I] compares 
the SD error obtained over the grid of a = [15 + ((0 : 
24) * 2), 100,160, 250] using window transform to the SD 
error with identity transform (which is the same as window 
transform with a -A oo) across horizontal / median plane and 
over all HRTF set directions. It can be seen that on average, 
such tuning decreases the SD error by about 10%. 


F. Computational Cost 

Consider the cost of computing the i th sample of (x * y)i 
where * is the convolution operation. Direct time-domain 

8 We omit the convolution transform T>c in experiments as applying a low- 
pass filter to the residuals entails a per-frequency error metric. 


Reconstructed Log-magnitude HRTFs 


Reconstruction Error, Transformation D,„ 



-3.5 


-4.5 


20 40 60 80 100 

Frequency Bin 



Fig. 7. SD error dependence on bandwidth of window transform for a sample 
HRIR. 


TABLE I 

Mean spectral distortion for individually tuned V Wj(J 



H-plane 

M-plane 

All directions 

(t — y oo 

2.72 

1.73 

2.49 

Tuned cr 

2.53 

1.57 

2.24 


convolution requires min{|x| ,\y\} real floating-point opera¬ 
tions, where \x\ , \y\ is the NNZE in each filter. In practice, 
convolution is normally done in blocks of fixed size (so-called 
partitioned convolution). In case of time-domain processing, 
partitioned convolution incurs neither memory overhead nor 
latency. 

At the same time, the state-of-the-art frequency-domain im¬ 
plementation l26l requires yy(\y\ log 2 \y\ + |^|)/(|^| — \x\ +1) 
complex floating-point operations per output sample. For a 
long input signal (e.g. \y\ = 44100 - i.e. one second at CD 
audio quality), time-domain algorithm is faster than frequency - 
domain implementation for \x\ < 127. Further, in real-time 
processing, latency becomes an issue, and one must use parti¬ 
tioned convolution (with reasonably small block size) and the 
overlap-and-save algorithm (27 ). In order to achieve e.g. 50 
ms latency, one must have \y\ = 2205. For this segment length, 
direct time-domain convolution incurs less computational cost 
when \x\ < 90. Thus, a time-domain convolution using sparse 
filter x as derived in this paper is arguably quite beneficial to 
the computational load incurred by the VAD engine. 

V. Discussion 

While our study presents the theoretical derivation of our 
factorization algorithm, a number of practical concerns have 
been omitted for reasons of scope. We provide a number of 
remarks on these below. 

First, an optimal NNZE is hardware dependent, as the 
crossover point between time-domain and frequency-domain 
convolution costs depends on the computational platform as 
well as on the specific implementations of both. For example, 
specialized digital signal processors can perform efficient real 
time-domain convolution via hardware delay lines whereas 
being less optimized for handling complex floating-point op¬ 
erations necessary for fast Fourier transform. 
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Second, the target reconstruction error can be adjusted to 
match a desired fidelity of spatialization. For instance, early 
reflections off nearby environmental features may have to be 
spatialized more distinctly than a number of low-magnitude 
later reflections that collectively form the reverberation tail. 
Further, the need to individually optimize the penalty term 
A for each direction depends also on desired sparsity (i.e. 
computational load) versus SD error trade-off. Such real-time 
load balancing is an open challenge that depends on available 
computational resources on specific hardware platform. 

Certain obvious extensions of the work presented has also 
not been fully described for clarity. We note that using non¬ 
zero A term and varying the bandwidth a in T>w, F>c trans¬ 
forms could lead to decrease in SD error at the same NNZE 
when tuned. A set of bandpass transformations that constitute 
the orthogonal basis for the discrete Fourier transform could 
also be used, as in this case the error could be weighted 
individually in each frequency band to match the listener’s 
characteristics (e.g. by using the equal loudness contours in 
frequency). 

Another consideration is the choice of the cost function in 
Eq. [3j which currently omits prior information on the HRIR 
measurement direction distribution. It may be undesirable to 
place equal weight on all directions if those are in fact spaced 
non-uniformly. Instead, the sample residual can be biased 
by introducing a kernel transformation V E R iVxiV of the 
HRIR measurement directions (V l: j is a kernel function eval¬ 
uation between directions i th and j th ) into the cost function 
tr ((X — FG t )'D~ 1 (X — FG t ) t ), which would decorrelate 
HRIR reconstruction error in densely-sampled area and thus 
avoid giving preferential treatment to these areas while opti¬ 
mizing. 

VI. Conclusions 

We have presented a modified semi-NMF matrix factor¬ 
ization algorithm for Toeplitz constrained matrices. The fac¬ 
torization represent each HRIR in a collection as a con¬ 
volution between a common “resonance filter” and specific 
“reflection filter”. The resonance filter has mixed sign, is 
direction-independent, and is of length comparable to original 
HRIR length. The reflection filter is non-negative, direction- 
dependent, short, and sparse. The tradeoff between sparsity 
and approximation error can be tuned via the regularization 
parameter of Li-NNLS solver, which also has the ability 
to place different weights on errors in different frequency 
bands (for HRTF) or at different time instants (for HRIR). 
Comparison between HRIR reconstructed using the proposed 
algorithm and Li-LS reference solution shows that the former 
has much better sparsity-to-error tradeoff, thus allowing for 
high-fidelity latency-free spatial sound presentation at very low 
computational cost. 
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Azimuth = -3.0543 






Azimuth = -2.4435 Azimuth 4> = —1.1345 









Fig. 6. A comparison between varying-sparsity L i-NNLS and L i-LS solutions for selected directions on horizonal and median planes. Angles are listed in 
radians. 





































































































































